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A New Intelligence Based Approach for 
Computer-Aided Diagnosis of Dengue Fever 

Vadrevu Sree Hari Rao, Senior Member, IEEE, and Mallenahalli Naresh Kumar 


Abstract —Identification of the influential clinical symptoms 
and laboratory features that help in the diagnosis of dengue 
fever in early phase of the illness would aid in designing effective 
public health management and virological surveillance strategies. 
Keeping this as our main objective we develop in this paper, a new 
computational intelligence based methodology that predicts the 
diagnosis in real time, minimizing the number of false positives 
and false negatives. Our methodology consists of three major 
components (i) a novel missing value imputation procedure that 
can be applied on any data set consisting of categorical (nominal) 
and/or numeric (real or integer) (ii) a wrapper based features 
selection method with genetic search for extracting a subset of 
most influential symptoms that can diagnose the illness and (iii) 
an alternating decision tree method that employs boosting for 
generating highly accurate decision rules. The predictive models 
developed using our methodology are found to be more accurate 
than the state-of-the-art methodologies used in the diagnosis of 
the dengue fever. 

Index Terms —dengue fever, classification, clinical diagnosis, 
prediction, imputation, features selection, genetic search, alter¬ 
nating decision trees 


I. Introduction 

D engue fever (DE) is a mosquito-borne infectious dis¬ 
ease caused by the viruses of the genus Togaviridae, 
subgenus Flavirus. The transmission of this disease is through 
the bites of vectors (aedes aegypti, aedes albopictus) carrying 
the viruses belonging to Flavi genus |[T]. Erom its hrst appear¬ 
ance in the Philippines in 1953, the disease has been identihed 
as one of the most important arthropod-borne viral disease 
in humans Q. Dengue virus infection has been reported in 
more than 100 countries, with 2.5 billion people living in areas 
where dengue is endemic. The annual occurrence is estimated 
to be around 100 million cases of DE and 250, 000 cases of 
dengue hemorrhagic fever (DHE). 

The diagnosis of dengue fever presents great challenges as 
the symptoms overlap with other febrile illnesses. Accurate 
diagnosis is possible only after conducting definitive tests such 
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as enzyme-linked immunosorbent assays (ELISA) and real¬ 
time polymerase-chain reaction (RT-PCR) which are based 
on nucleic and acid hybridization ||^. A recent study 0 on 
the behavior of C-type lectin domain family 5, member A 
(CLEC5A) gene may result in a strategy for reducing tissue 
damage which would help improve the odds of survival of 
the patients suffering from DHE and dengue shock syndrome 
(DSS). A multivariate model was developed in Q for pre¬ 
dicting hemoglobin (Hb) using predictors such as reactance 
obtained from a single frequency bioelectrical impedance 
analysis, sex, nausea/vomiting sensation and weight. These 
strategies can be employed only after 2—12 days from 
the onset of the illness and require state-of-the-art laboratory 
facilities. 

The World Health Organization (WHO) has arrived at a 
classification scheme for identifying the infected individuals 
based on clinical symptoms and laboratory features. The de¬ 
velopment of predictive models for diagnosis of dengue fever 
based on these schemes is affected by missing or incomplete 
data records in the clinical databases ||^ which may arise due 
to any or all of the following reasons (i) value being lost 
(erased or deleted) (ii) not recorded (iii) incorrect measure¬ 
ments (iv) equipment errors and (v) an expert not attaching any 
importance to a particular clinical procedure. Usually data is 
not collected from an organized research point of view |[^. The 
presence of large number of clinical symptoms and laboratory 
features requires one to search large sub spaces for optimal 
feature subsets. These issues unless addressed appropriately 
would hinder the development of accurate and computationally 
effective diagnostic system. 

In view of the above challenges, we present the following 
novel features of our work; 

• to identify the missing values (MV) in the data set 
and impute them by using a newly developed novel 
imputation procedure; 

• to identify a set of clinical symptoms that would enable 
early detection of suspected dengue in children and 
adults, which reduces the risk of transmission of the 
dengue fever in the community; 

• to identify the laboratory features and clinical symptoms 
that would enable better diagnosis and understanding of 
the disease in suspected dengue individuals. This renders 
optimal utilization of the laboratory resources required 
for conhrmed diagnosis; 

• to build a predictive model that has a capability of render¬ 
ing effective diagnosis in realtime. Eurther we compare 
its performance with other state-of-the-art methods used 
in the diagnosis of dengue fever. 
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The present paper is organized as follows: A survey of the 
state-of-the-art techniques for the diagnosis of dengue fever is 
presented in Section while in Section m we describe our 
novel methodology for computer-aided clinical diagnosis of 
dengue. The performance evaluation of the methodologies is 
described in Section IV The description of the data sets and 
the experimental results are presented in Section|V] We present 
a comparison of our new imputation methodology with other 
imputation methods in Section VI In Section VII we discuss 
the computational complexity of our new method. Comparison 
of our new methodology with other state-of-the-art methods 
forms the subject of Section [VIII| Conclusions and discussion 
are deferred to Section im 


II. Survey of the state-of-the-art techniques for 
DIAGNOSIS of dengue FEVER 


Logistic regression method was employed to identify clini¬ 
cal symptoms and laboratory features in 381 individuals, out 
of which 148 were confirmed dengue Q. The data records 
with missing values (MV) are ignored and are deleted from 
the data set. In Q, the study was conducted on clinical records 
comprising of 341 children and 597 adults out of which 38 and 
107 respectively were laboratory-confirmed positive dengue 
cases. In this study the data fields that are incomplete or 
inaccurate for all suspected dengue cases were replaced with 
the known values corresponding to the information in the 
medical charts. A C4.5 decision tree which has an in built 
mechanism of handling MV was employed in m to develop a 
diagnostic algorithm to differentiate dengue from non-dengue 
illness on a data set comprising of 1200 patients of which 173 
had DF, 171 had DHF and 20 had DSS. A support vector 
machine (SVM) based methodology was employed in lO 
to analyze the expression pattern of 12 genes of 28 dengue 
patients of which 13 were DHF and 15 were DF cases. A 
set of seven influential genes were identified through selective 
removal of expression data of these twelve genes. 


In the above studies the MV were either removed ©, or 
filled with approximate values based on medical charts ©. 
These approaches would lead to biased estimates and may 
either reduce or exaggerate the statistical power. Methods such 
as logistic regression, maximum likelihood and expectation 
maximization have been employed for imputation of MV, but 
they can be applied only on data sets that are either nominal 
or numeric. There are other imputation methods such as k- 
nearest neighbor imputation (KNNI) 112|; k-means clustering 
imputation (KMI) HD; weighted k-nearest neighbor imputa¬ 
tion (WKNNI) | [T4) and fuzzy k-means clustering imputation 
(FKMI) p3) that have been applied on other data sets but not 
on dengue fever data sets. However, the authors in ©, ©, pT[ 
have employed methods such as odds ratio (OR) and selective 
inclusion or exclusion of attributes for obtaining features sub 
sets of data sets of dengue fever. But these methods do not 
yeild effective diagnosis as all interactions or correlations 
between the features and the diagnosis are not considered in 
these studies. 


HI. A NEW METHODOLOGY FOR COMPUTER-AIDED 
DIAGNOSIS OF DENGUE FEVER 

Motivated by the above issues we propose a new method¬ 
ology comprising of a novel non parametric missing value 
imputation method that can be applied on data sets consisting 
of attributes that are of the type categorical (nominal) and/or 


numeric (integer or real). The methodology proposed in 115| 
ignores missing values while generating the decision tree, 
which renders lower prediction accuracies. We have embedded 


the new imputation strategy (Section III-B i before generating 


the alternating decision tree which results in the improved 
performance of the classifier on data sets having missing 
values. Also, we develop an effective wrapper based features 
selection algorithm in order to identify the most influential 
features subset. The present methodology comprises in uti¬ 
lizing the new imputation embedded alternating decision tree 
and the wrapper based features subset selection algorithm. 
This methodology can predict the diagnosis of dengue in real 
time. In fact the machine knowledge acquired by utilizing this 
novel methodology will be useful to diagnose other individuals 
based on clinical symptoms and laboratory features where 
the clinical decision is unavailable. We designate this novel 
methodology as NM throughout this work. 


A. Data representation 

A clinical data set can be represented as a set S 
having row vectors {Ri, R 2 ,..., Rm) and column vec¬ 
tors {Ci,C 2 , ■ ■ ■ ,Cn)- Each record can be represented as 
an ordered n-tuple of clinical and laboratory attributes 
(A,i, A, 2 , ..., ^*(ri-i), for each i = 1,2,..., m where 
the last attribute {Ain) for each i, represents the physician’s 
diagnosis to which the record {An, Ai 2 ,..., ^i(n-i)) belongs 
and without loss of generality we assume that there are no 
missing elements in this set. Each attribute of an element in 
S that is Aij for i = 1,2,... ,m and j — 1, 2,...,n — 1 can 
either be a categorical (nominal) or numeric (real or integer) 
type. Clearly all the sets considered are finite sets. 


B. A new non-parametric imputation strategy 

The first step in any imputation algorithm is to compute the 
proximity measure in the feature space between the clinical 
records to identify the nearest neighbors from where the values 
can be imputed. The most popular metric for quantifying the 
similarity between any two records is the Euclidean distance. 
Even though this metric is simpler to compute, it is sensitive to 
the scales of the features involved. Eurther it does not account 
for correlation between the features. Also, the categorical 
variables can only be quantified by counting measures which 
calls for the development of effective strategies for computing 
the similarity m- Considering these factors we first propose 
a new indexing measure la {Ri,Rk) between two typical 
elements Ri, Rk for i,k = 1, 2,..., m, I = 1, 2,..., n — 1 
belonging to the column Ci of S which can be applied on any 
type of data, be it categorical (nominal) and/or numeric (real 
or integer). We consider the following cases: 

CtlSe It Ain — Aj^n 

Let A denote the collection of all members of S that 
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belong to the same decision class to which Ri and Rk 
belong and does not have MV. Based on the type of the 
attribute to which the column Ci belongs, the following 
situations arise: 


(i) Elements of the column C; of S are of categorical 
(nominal) type: 

We now express A as a disjoint union of non¬ 
empty subsets of A, say , B ~^^^, • ■ •, B^^^ ob¬ 
tained in such a manner that every element of A 
belongs to one of these subsets and no element 
of A is a member of more than one subset of A. 
That is A = B^^^[j,... ,[j B^^^, in which 

Ipi ) lp 2 ) • ■ •) Ips denote the cardinalities of the respec¬ 
tive subsets Bj^^, Bj^^ , ■ • ■, B^^^ formed out of the 
set A, with the property that each member of the same 
subset has the same first co-ordinate and members of 
no two different subsets have the same first co-ordinate. 
We define an index 


Ic, {Ri, Rk) 


minj^, — I, for i ^ k\ 
0 , otherwise. 


where 7 j,. represents the cardinality of the subset 
B^^ , all of whose elements have first co-ordinates 
Ail and 7 ^^ represents the cardinality of that subset 
B^ , all of whose elements have first co-ordinates 

Aki- 


(ii) Elements of the column Ci of S are of numeric type: 
Numeric types can be classified further as integers 
(whole numbers) or real (fractional numbers). If the 
attribute is of integer type then we follow the procedure 
discussed in Case I item (i). Eor fractional numbers we 
construct the index Ic, {Ri, Rk), based on the ratio of 
the values of the elements Aii,Aki of column to 
the mean of the set of elements belonging to A that do 
not have MV and is given by 


Ici {Ri, Rk) 


min{^,^}, fox 
0 , otherwise. 


In the above definition A"^ denotes the average of the 
column entries of all the elements of the set A 
excluding those with MV in the column. 

Ctise lit Aiyi ^ Akn 

Clearly Ri and Rk belong to two different decision 
classes. Consider the subsets Pi and Qk consisting of 
members of S that share the same decision with Ri 
and Rk respectively and does not have MV. Clearly 
PiClQk = 0- Based on the type of the attribute to which 
the column C; belongs, the following situations arise: 


(i) Elements of the column C; of S are of nominal or 
categorical type: 

Eollowing the procedure discussed in Case I item (i) 
we write P and Q as a disjoint union of non-empty 
subsets of , P /32 ,..., P/ 3 ^ and Qsi,Qs2,---, Qs, 
respectively in which j3i, ^ 2 , ■ ■ ■, Pr and 5 i, 52 , ■ ■ ■ ,5a 


indicate the cardinalities of the respective subsets. We 
define the indexing measure between the two records 
Ri and Rk as 


Ici {Ri, Rk) 


max{|^,|^}, fox i ^ k 
0 , otherwise. 


where Pr represents the cardinality of the subset 
all of whose elements have first co-ordinates An in 
set P and Sg represents the cardinality of that subset 
Qs^, all of whose elements have first co-ordinates Aki 
in set Q. 


(ii) Elements of the column Ci of S are of numeric type: 
If the type of the attribute is integer we follow the 
procedure discussed in Case II item (i). Eor fractional 
numbers we define the index Ic, {Ri, Rk) between the 
two records RiandRk as 


Ici{Ri,Rk) 


max{^, ^}, for i ^ k 
0 , otherwise. 


In the above definition A = minlP"^, where P"^, 

and denote the average of the first column entries 
of all the elements of the sets P and Q excluding those 
with MV in the column. 

The proximity or distance scores between the clinical 
records in the data set S can be represented as D = 

{{ 0 , di 2 , ■ ■ ■ , dim }; {d 2 i ,0 , . ■ . , 5 ■ • ■ 7 7 dm 2 7 ■ ■ ■ 7 0 }} 

where dik = \IYaZi Ic, {Ri,Rk). Eor each of the missing 
value instances in a record Ri our imputation procedure 
first computes the score z{dij) = where 

j = 1,2...,TO and d denotes the mean distance. We then 
pick up only those records (nearest neighbors) which satisfy 
the condition z{dij) < 0 where {dii,di 2 , ■ ■ ■ ,dim} denote 
the distances of the current record Ri to all other records in 
the data set S. If the type of attribute is categorical or integer, 
then the data value that has the highest frequency (mode) 
of occurrence in the corresponding columns of the nearest 
records is imputed. Eor the data values of type real we impute 
the mean of data values in the corresponding columns of the 
nearest records. 

Illustrative example: The following example illustrates the 
spirit of the new imputation algorithm. Consider a data set 
represented by the matrix S consisting of rows Pi=(?, 12.0, 
positive), i? 2 =( yes, 10.5, positive), i? 3 =( no, 14.0, positive) 
and i? 4 =(no, 13.0, negative). The missing value instance (’?’) 
in this data set is present in record i?i and column Ci . These 
rows correspond to the data records of four individuals. Clearly 
the Case I item (i) of the imputation algorithm applies to 
this data set for determining the missing value. The matrix of 
the indexing measure I has the following rows: (0,0.86) and 
(0,0.99) in which 7 ^ = 0, 7 ^ = 1 and = 12.17. The rela¬ 
tive distances between i?i and the other records are computed 
as {0.93,0,0} and the corresponding z-scores are obtained as 
{—0.57, —0.57,1.154}. Since z < 0 for the distances between 
Ri and R 2 and also Ri and R 3 , we conclude that the records 
i ?2 and R 3 are nearer to i?i and hence the highest frequency 
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(mode) of the data value in column Ci is ’yes’. Accordingly 
this value is a suitable candidate for imputation. 

C. Identification of influential features 

In situations presented by real world processes, influential 
features are often unknown a priori, hence features that are 
redundant or those that are weakly participating in decision 
making must be identified and appropriately handled. The 
features selection procedures can be categorized as random 
or sequential. The sequential methods such as forward selec¬ 
tion, backward elimination and bidirectional selection employ 
greedy methods and hence may not often be successful in 
finding the optimal features subsets. In contrast to this stochas¬ 
tic optimization methods such as genetic algorithms (GAs) 
perform global search and are capable of effectively exploring 
large search spaces GD- In our approach we adopt a wrapper 
subset based feature evaluation model m where the method 
of classification itself is used to measure the importance of the 
features sub set identified by the GA. 


Algorithm 1 The NM Methodology 

Input: (a) Data sets for the purpose of decision making S(m, n) where m and n are 
number of records and attributes respectively and the members of S may have MV 
in any of the attributes except in the decision attribute, which is the last attribute 
in the record. 

(b) The type of attribute C of the columns in the data set. 

Output: (a) Classification accuracy for a given data set S. 

(b) Performance metrics AUC, SE, SR 

Algorithm 

(1) Identify and collect all records in a data set S _ 

(2) Impute the MV in the data set S using the procedure discussed in Section [lII-B| 

(3) Extract the influential features using a wrapper based approach with genetic search 
for identifying features subsets and alternating decision tree for its evaluation as 
discussed in Section [IlI-CI 

(4) Split the dataset in to training and testing sets using a stratified k fold cross 
validation procedure. Denote each training and testing data set by Tfc and Rk 
respectively. 

(5) Eor each k compute the following 

(i) Build the ADT using the records obtained from Tfc. 

(ii) Compute the predicted probabilities (scores) for both positive and negative 
diagnosis of dengue from the ADT built in Step (5)-(i) using the test data set 
Rk - Designate the set consisting of all these scores by P. 

(iii) Identify and collect the actual diagnosis from the test data set Rk in to set 
denoted by L. 

(6) Repeat the Steps (5)-(i) to Step (5)-(iii) for each fold. 

(7) Obtain the performance metrics AUC, SE and SP utilizing the sets L and P. 

(8) RETURN AUC, SE, SP. 

(9) END. 


D. Predictive modeling using decision trees 

An alternating decision tree (ADT) consists of decision 
nodes (splitter node) and prediction nodes which can either be 
an interior node or a leaf node. The tree generates a prediction 
node at the root and then alternates between decision nodes 
and further prediction nodes. Decision nodes specify a pred¬ 
icate condition and prediction nodes contain a single number 
denoting the predictive value. An instance can be classified by 
following all paths for which all decision nodes are true and 
summing the relevant prediction nodes that are traversed. A 
positive sum implies membership of one class and the negative 
sum indicates the membership of the opposite class. 

IV. Performance evaluation methods 

The standard definitions of the performance measures such 
as the specificity (SP), sensitivity (SE), receiver operator 
characteristics (ROC) and area under ROC (AUC) based on 
number of true positives, true negatives, false positives and 
false negatives are utilized in our experimental analysis. We 
employed a stratified fc-fold cross validation for estimating 
the test error on classification algorithms. We have randomly 
divided the given data set into k disjoint subsets. Each subset 
is roughly of equal size and has the same class proportions 
as in the original data set. The classification model has been 
built by setting aside one of the subsets as test data set and 
train the classifier using the other nine subsets. The trained 
model is then employed in classifying the test data set. The 
experiment is repeated by setting aside each of the k subsets 
as test data sets one at a time. To compute ROC for k folds we 
first train a classifier using the training data set of a A: fold and 
then obtain the scores in terms of the predicted probabilities 
for positives and negatives from the trained classifier using the 
test data set corresponding to the same fold as the training data. 
Once all the probabilities and corresponding actual decisions 
are collected, the ROC is obtained by first computing the 
thresholds using the quartiles of the cumulative predictive 


probabilities of all the k folds. Eor each threshold value the 
measures SE and SP are computed. The false positive rate and 
true positive rate values of the ROC is taken as (1-SP) and SE 
respectively. The AUC is computed by applying a trapezoidal 
rule on the data points of the ROC curve. The optimal cut 
off or operating point is the threshold that is closest point to 
(0,1) on the ROC curve which gives the equal error rate. The 
optimal values of AUC, SE, SP are computed for this cut off 
point. 


V. Experiments and results 


In our methodology we have employed a stratified ten-fold 
cross validation ( k = 10) procedure. We applied a standard 
implementation of SVM with radial basis function kernel Ol 
using LibSVM package The GA algorithm for features 
selection has been performed using the parameter values; 
cross over probability=1.0 and mutation probability=0.001. 
The standard implementation of C4.5, LOR algorithms in 
Weka© ||^ are considered for evaluating the performance 
of our algorithm. We have implemented the NM algorithm 
and the performance evaluation methods in Matlab®. A non- 


parametric statistical test proposed by Wilcoxon |211 is used 


to compare the performances of the algorithms. We compared 
the NM with the state-of-the-art methodologies employed 
in diagnosis of dengue fever using different performance 
measures discussed in Section EYl 


A. Data sets 

We have obtained four surveillance data sets from case- 
patients admitted into hospitals located in central and western 
States of India. Standard procedures were adopted in collect¬ 
ing the clinical and demographic attributes of the patients. 
The probable cases of the dengue fever are arrived through 
definitive laboratory tests such as ELISA. The patients records 
include clinical symptoms; fever, fever duration, headache. 
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TABLE I Performance comparison of the NM with other 
methodologies (C4.5, SVM and LOR) on the data sets used 
in the present study 


Dataset 

Method 

Accuracy 

(%) 

SE 

SP 

AUC 


NM 

100.00 

100.00 

100.00 

1.00 


C4.5 

96.44 

95.90 

97.27 

1.00 


LOR 

91.02 

89.49 

93.36 

0.96 


SVM 

96.75 

97.18 

96.09 

0.97 


NM 

86.53 

88.97 

82.81 

0.93 


C4.5 

82.35 

87.18 

75.00 

0.84 


LOR 

72.91 

74.36 

70.70 

0.78 


SVM 

78.17 

89.49 

60.94 

0.75 


NM 

100.00 

100.00 

100.00 

1.00 


C4.5 

94.97 

95.41 

93.55 

0.99 


LOR 

92.71 

92.79 

92.47 

0.96 


SVM 

98.99 

98.69 

100.00 

0.99 


NM 

95.48 

98.03 

87.10 

0.95 


C4.5 

90.20 

91.48 

86.02 

0.91 


LOR 

88.44 

89.84 

83.87 

0.90 


SVM 

92.71 

98.03 

75.27 

0.87 


retro-orbital pain (eye pain), myalgia (body pain), arthralgia 
(joint pain), nausea or vomiting, bleeding gums, rash, bleeding 
sites, restlessness and abdominal pain and laboratory features; 
haemoglobin (Hb), white blood cell count (WBC), packed 
cell volume (PCV) and platelets. The last attribute in data 
set is the decision attribute. The clinical records are then re¬ 
grouped into four data sets. The first data set (DSl) comprises 
of 646 adults (age> 16 years) with clinical symptoms and 
laboratory features out of which 256 were dengue positive 
and 390 are dengue negative. The second data set (DS2) is 
a part of DSl consisting of only clinical symptoms (ignoring 
the laboratory features) and has the same number of records 
as in DSl. The third data set (DS3) consists of 398 children 
(age between 5—15 years) 0 with clinical symptoms and 
laboratory features, out of which 93 were dengue positive and 
305 were dengue negative. The fourth data set (DS4) is a part 
of DS3 with only clinical symptoms and has same number of 
records as DS3. 


B. Results 


The performance of the NM is compared with other method¬ 
ologies (C4.5, SVM and LOR) on the data sets used in the 
present study and the classification accuracies are presented 
in Table |I] A hundred percent accuracy is reported by NM 
both in data sets DSl and DS3. The Wilcoxon matched-pairs 
rank sum test results comparing the accuracies of NM with 
other methodologies are shown in Table For example, the 
positive rank sum of 55.0 and negative rank sum of 0.0 with a 
p-value< 0.01 for C4.5 using data set DSl (first row Table 0 
indicates the superior performance of the new methodology 
over C4.5 and also in respect of other methods as well. 


TABLE III Influential features subsets identified by NM 


Data 

# Orignal 

# influential 

Accuracy 

features 

set 

features 

features 

(%) 

identified 

DSl 

16 

5 

100.00 

retro-orbital pain , arthralgia, 
fever duration, platelet, fever 

DS2 

9 

6 

86.53 

vomiting or nausea, myalgia, 
rash, bleeding sites, abdominal 
pain, arthralgia 

DS3 

16 

2 

100.00 

Hb, fever 

DS4 

9 

2 

95.48 

retro-orbital pain, arthralgia 


TABLE II Wilcoxon matched-pairs rank sum test for compar¬ 
ing the performance of NM with other methodologies used in 
diagnosis of dengue fever 


Dataset 

Method 

Rank suni(-i-, -) 

p-value 


C4.5 

55.0, 0.0 

0.002 


LOR 

55.0, 0.0 

0.002 

DSl 

SVM 

45.0, 0.0 

0.004 


C4.5 

55.0, 0.0 

0.002 


LOR 

55.0, 0.0 

0.002 

DS2 

SVM 

55.0, 0.0 

0.002 


C4.5 

36.0, 0.0 

0.008 


LOR 

36.0, 0.0 

0.008 

DS3 

SVM 

10.0, 0.0 

0.125 


C4.5 

38.5, 6.5 

0.074 


LOR 

37.0, 8.0 

0.098 

DS4 

SVM 

27.0, 9.0 

0.25 


The above comparisons and statistical tests clearly demon¬ 
strate the significance of our methodology in identifying the 
suspected dengue both in children and adults. The imputation 
strategy employed in our methodology has improved the 
classification accuracies when compared with C4.5 which uses 
a modified information gain measure to generate the decision 
tree in presence of MV. The mean imputation strategies 
adopted in SVM and LOR could not render classification 
accuracies higher than NM. 

The features subsets identified by the NM is shown in 
Table III The application of features selection method reduced 
the number of attributes by 75% in DSl and 87.5% in DS3 
data sets. Our methodology identified some of the clinical 
symptoms and laboratory features in adults (vomiting and 
abdominal pain) different from those in children which are 
in concurrence with earlier studies | |22l , IT?) . The clinical 
attribute rash was identified as an important feature in adults 
but not in children. This may be explained by the relative 
frequency of the secondary infections in adults p4) . Arthralgia 
was found to be influencing the final diagnosis of dengue 
both in children and adults. The ROC curves comparing the 
performance of NM with other methodologies are shown in 
Figs. Tapd The operating point or cut off point (p < 0.001) 
is shown as a pentagon on each of the ROC curves. The ROC 
curves clearly demonstrate the superior performance of NM 
over other methods used in the diagnosis of dengue. 


VI. Performance comparison of new imputation 

ALGORITHM WITH BENCH MARKING DATA SETS 
Since no specific studies on imputation of missing values 
in dengue data sets we have utilized some bench marking 
data sets obtained from Keel and University of California 
Irvin (UCI) machine learning data repositories | [25) , | [26) to 
test the performance of the new imputation algorithm. The 
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Fig. 1. ROC curves 





Wilcoxon statistics in Table IV is computed based on the 
accuracies obtained by the new imputation algorithm with the 
accuracies of those obtained by other imputation algorithms 
using a C4.5 decision tree. The results in Table IV clearly 
demonstrate the fact that our algorithm is superior to other 
imputation algorithms as the positive rank sums are higher 
than the negative rank sums (p < 0.05) in all the cases. 


TABLE IV Wilcoxon sign rank statistics for matched pairs 
comparing the new imputation algorithm with other imputation 
methods using C4.5 decision tree 


Method 

Rank Sums 
(+, -) 

Test 

Statistics 

Critical 

Value 

p-value 

FKMI 

78.5, 12.5 

12.5 

18 

0.021 

KMI 

85.0, 6.0 

6 

18 

0.003 

KNNI 

76.0, 15.0 

15 

18 

0.032 

WKNNI 

83.0, 8.0 

8 

18 

0.006 


VII. Computational complexity 

The computational complexity is a measure of the perfor¬ 
mance of the algorithm. For each data set having n attributes 
and m records, we select only those subset of records mi < 
m, in which missing values are present. The distances are 
computed for all attributes n excluding the decision attribute. 
So, the time complexity for computing the distance would be 
0{mi * (n— 1)). The time complexity for selecting the nearest 
records is of order 0(mi). For computing the frequency of 
occurrences for nominal attributes and average for numeric 
attributes the time taken would be of the order Oirni). 
Therefore, for a given data set with fc-fold cross validation 
having n attributes and m records, the time complexity of 
our new imputation algorithm would be fc * (0(mi *{n — 
1) * m) -f 2 * 0(mi)) which is asymptotically linear. Our 
experiments were conducted on a personal computer having a 
Intel(R) core (TM) 2 Duo, CPU @2.93 GHZ processor with 
4 GB RAM. For each data set the computational time for 
imputation and features selection is measured in terms of the 
number of CPU clock cycles elapsed in seconds. Based on the 
results, we obtain a scatter plot (red line in Fig. between 
the varying database sizes and the time taken by NM. Also, 
we employed a linear regression on our results and obtained 
the relation between the time taken (T) and the data size (D) 
as T = OMD + 5.54, a = 0.05, p < 0.05, = 0.98. The 



Fig. 2. Computational complexity of the NM 


presence of the linear trend between the time taken and the 
varying database sizes ensures the numerical scalability of the 
performance of NM in terms of asymptotic linearity. 


VIII. Comparison of related methodologies on 

DENGUE STUDIES 

In this section we compare the results (Table |y|i obtained 
in |8|-p0) with the results of our new methodology on the 
current data set of 1044 individuals including children and 
adults. As compared to ||^ where children with rash were 
having SE of 41.2% and SP of 95.5% our methodology when 
applied on the data set DS2 resulted in an accuracy of 86.53%, 
SE of 88.97% and SP of 82.81% which is considered to be 
a good classification model as both SE and SP are higher 
than 80%. In | [To| both clinical and laboratory features were 
utilized to develop decision rules using C4.5 decision tree 
and they have reported a SE of 87.8% and SP of 75.7%. In 
comparison to 1101 our methodology when applied on DS1 and 
DS3 had resulted in SE of 100% and SP of 100%. Prom these 
comparisons we conclude that the new methodology presented 
in this study if applied on the data sets used in ©-ID would 
yield more accurate results. 


IX. Conclusions and discussion 
A new methodology (NM) with built in features for im¬ 
putation of missing values and identification of influential 
attributes is discussed. The NM has out performed the state- 
of-the-art methodologies in diagnosis of dengue fever on all 
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TABLE V Evaluation of NM with other related methodologies 
on dengue studies 


State-of-the- 

art 

#Patients 

(DF) 

Records 

with 

MV 

Methods 

Accuracy 

(%) 

SE 

(%) 

SP 

(%) 

Chadwick 
et al., 

(clinical) 

381 

(148) 

deleted 

LOR, 

OR 

84.5 

84 

85 

Chadwick 
et al., 
(laboratory) 

381 

(148) 

deleted 

-do- 

76.5 

74 

79 

Ramos et 

al.. 0 

(clinical, 
children) 

938 

(38) 

manual 

update 

-do- 

68.95 

41.2 

95.5 

Tanner et 

al., [T^ 

(laboratory) 

1200 

(173) 

deleted 

C4.5 

81.75 

87.8 

75.7 

Gomes et 

al., |11| (gene 
database) 

20 (15) 


SVM 

85 



NM (DSl) 

(adults, clinical 
& laboratory) 

1044 

(256) 

imputed 

(new 

algo- 

rithm) 

ADT, 

GA 

100 

100 

100 

NM (DS2) 

(adults, 

clinical) 

1044 

(256) 

-do- 

-do- 

86.53 

88.97 

82.81 

NM (DS3) 

(children, 
clinical & 

laboratory) 

1044 

(93) 

-do- 

-do- 

100 

100 

100 

NM (DS4) 

(children, 

clinical) 

1044 

(305) 

-do- 

-do- 

95.48 

98.03 

87.10 


the four data sets considered in our experiments. The NM 
has generated a decision tree with an accuracy of 100.0% in 
children and adults using both clinical and laboratory features. 
Based on the performance measures we conclude that the use 
of the new imputation strategy and features selection methods 
with wrapper based subset evaluation using genetic search has 
improved the accuracies of the predictions. Though the new 
methodology discussed in this paper may be taken as a univer¬ 
sal tool for the effective diagnosis of this disease, it remains 
to be seen whether or not this methodology is geographically 
independent. However, we are willing to share our predictive 
methodologies and strategies with the researchers working on 
dengue fever all over the globe. We hold the view that more 
intensive and introspective studies of this kind will pave way 
for better clinical management and virological surveillance of 
dengue fever. 
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